{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2f01b288",
   "metadata": {},
   "source": [
    "# RAG personal bot\n",
    "\n",
    "Exercise for week 5 of LLM Engineering course.\n",
    "\n",
    "This notebook will create a personal RAG bot. It will use a the ./kb directory to store the files that we want to include in the RAG. Subdirectories will be used to denote categories for the files.\n",
    "**Important: only one level of subdirectories will be used for the categories**\n",
    "\n",
    "It uses LangChain to create and process the RAG pipeline and chat.\n",
    "The voector database persistent sotre is in the ./vdb folder. \n",
    "\n",
    "In this version we use chromadb for the vector store.\n",
    "The store is recreated each run. This is not efficient for large datasets. \n",
    "\n",
    "Future upgrades - To Do (in no particular order): \n",
    "- [X] Create a fully local version for security and privacy\n",
    "- [ ] Create persistent data store - only load, chunk and embed changed documents. \n",
    "- [ ] Provide selection of vector db engines (Chroma DB as default, or connect to external vector db e.g. ElasticSearch or AWS Opensearch)\n",
    "- [ ] Add an interface to upload documents in data store - including user-defined metadata tags\n",
    "- [ ] Add more document data types\n",
    "- [ ] Add online search capability - use web crawler tool to crawl a website and create website-specific RAG bot\n",
    "- [ ] Read e-mails/calendars/online docs (Amazon S3 bucket, Google Drive)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dfe8e48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# These were necessary as langchain does not install them by default\n",
    "!pip install pypdf\n",
    "!pip install pdfminer.six\n",
    "!pip install python-docx\n",
    "!pip install docx2txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "193171c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "\n",
    "import os\n",
    "import glob\n",
    "from dotenv import load_dotenv\n",
    "import gradio as gr\n",
    "\n",
    "# imports for langchain, plotly and Chroma\n",
    "# plotly is commented out, as it is not used in the current code\n",
    "\n",
    "from langchain.document_loaders import DirectoryLoader, TextLoader, PDFMinerLoader, Docx2txtLoader\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "# from langchain.schema import Document\n",
    "from langchain_openai import OpenAIEmbeddings, ChatOpenAI\n",
    "from langchain_chroma import Chroma\n",
    "#import matplotlib.pyplot as plt\n",
    "#from sklearn.manifold import TSNE\n",
    "#import numpy as np\n",
    "#import plotly.graph_objects as go\n",
    "from langchain.memory import ConversationBufferMemory\n",
    "from langchain.chains import ConversationalRetrievalChain\n",
    "# from langchain.embeddings import HuggingFaceEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d22d2e48",
   "metadata": {},
   "outputs": [],
   "source": [
    "MODEL = \"gpt-4o-mini\"\n",
    "db_name = \"vdb\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc23bf8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load environment variables in a file called .env\n",
    "\n",
    "load_dotenv(override=True)\n",
    "os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0103ef35",
   "metadata": {},
   "source": [
    "## Loading the documents\n",
    "In the code below we read in the KB documents and create the vector store. \n",
    "We will be adding PDF documents, Word documents and text/markdown documents.\n",
    "Each document has its own loader, which we are calling separately through DirectoryLoader.\n",
    "At the end, we are combining the results, and then start splitting the documents using the Recursive Character Text Splitter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f20fd20",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read in documents using LangChain's loaders\n",
    "# Take everything in all the sub-folders of our knowledgebase\n",
    "\n",
    "folders = glob.glob(\"kb/*\")\n",
    "print(f\"Found {len(folders)} folders in the knowledge base.\")\n",
    "\n",
    "def add_metadata(doc, doc_type):\n",
    "    doc.metadata[\"doc_type\"] = doc_type\n",
    "    return doc\n",
    "\n",
    "# For text files\n",
    "text_loader_kwargs = {'encoding': 'utf-8'}\n",
    "\n",
    "documents = []\n",
    "for folder in folders:\n",
    "    print(f\"Loading documents from folder: {folder}\")\n",
    "    doc_type = os.path.basename(folder)\n",
    "    # PDF Loader\n",
    "    pdf_loader = DirectoryLoader(folder, glob=\"**/*.pdf\", loader_cls=PDFMinerLoader)\n",
    "    # Text loaders\n",
    "    txt_loader = DirectoryLoader(folder, glob=\"**/*.txt\", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)\n",
    "    md_loader = DirectoryLoader(folder, glob=\"**/*.md\", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)\n",
    "    # Load MS Word documents - UnstructuredWordDocumentLoader does not play well with numpy > 1.24.0, and we use Docx2txtLoader instead. \n",
    "    # doc_loader = DirectoryLoader(folder, glob=\"**/*.doc\", loader_cls=UnstructuredWordDocumentLoader)\n",
    "    docx_loader = DirectoryLoader(folder, glob=\"**/*.docx\", loader_cls=Docx2txtLoader)\n",
    "    # document doc_type is used to identify the type of document\n",
    "    # Load documents from PDF, text and word files and combine the results\n",
    "    pdf_docs = pdf_loader.load()\n",
    "    print(f\"Loaded {len(pdf_docs)} PDF documents from {folder}\")\n",
    "    text_docs = txt_loader.load() + md_loader.load()\n",
    "    print(f\"Loaded {len(text_docs)} text documents from {folder}\")\n",
    "    word_docs = docx_loader.load()\n",
    "    print(f\"Loaded {len(word_docs)} Word documents from {folder}\")\n",
    "    folder_docs = pdf_docs + text_docs + word_docs\n",
    "    # Add metadata to each document\n",
    "    if not folder_docs:\n",
    "        print(f\"No documents found in folder: {folder}\")\n",
    "        continue\n",
    "    documents.extend([add_metadata(doc, doc_type) for doc in folder_docs])\n",
    "\n",
    "# Split the documents into chunks\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
    "chunks = text_splitter.split_documents(documents)\n",
    "\n",
    "# Print out some basic info for the loaded documents and chunks\n",
    "print(f\"Total number of documents: {len(documents)}\")\n",
    "print(f\"Total number of chunks: {len(chunks)}\")\n",
    "print(f\"Document types found: {set(doc.metadata['doc_type'] for doc in documents)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "021cadc7",
   "metadata": {},
   "source": [
    "## Vector Store\n",
    "\n",
    "We use Chromadb for vector store\n",
    "Same code as the one in the lesson notebook, minus the visualization part"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efc70e3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# embeddings = OpenAIEmbeddings()\n",
    "\n",
    "# If you would rather use the free Vector Embeddings from HuggingFace sentence-transformers\n",
    "# Then replace embeddings = OpenAIEmbeddings()\n",
    "# with:\n",
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "embeddings = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n",
    "\n",
    "# Delete if already exists\n",
    "\n",
    "if os.path.exists(db_name):\n",
    "    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()\n",
    "\n",
    "# Create vectorstore\n",
    "\n",
    "vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)\n",
    "print(f\"Vectorstore created with {vectorstore._collection.count()} documents\")\n",
    "\n",
    "# Let's investigate the vectors\n",
    "\n",
    "collection = vectorstore._collection\n",
    "count = collection.count()\n",
    "\n",
    "sample_embedding = collection.get(limit=1, include=[\"embeddings\"])[\"embeddings\"][0]\n",
    "dimensions = len(sample_embedding)\n",
    "print(f\"There are {count:,} vectors with {dimensions:,} dimensions in the vector store\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9af1d32",
   "metadata": {},
   "source": [
    "## LangChain\n",
    "Create Langchain chat, memory and retrievers.\n",
    "\n",
    "Note: for this localized version, Gemma3 4B worked much better than Llama 3.2, with my documents. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2360006e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a new Chat with OpenAI\n",
    "#llm = ChatOpenAI(temperature=0.7, model_name=MODEL)\n",
    "\n",
    "# Alternative - if you'd like to use Ollama locally, uncomment this line instead\n",
    "llm = ChatOpenAI(temperature=0.7, model_name='gemma3:4b', base_url='http://localhost:11434/v1', api_key='ollama')\n",
    "\n",
    "# set up the conversation memory for the chat\n",
    "memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)\n",
    "\n",
    "# the retriever is an abstraction over the VectorStore that will be used during RAG\n",
    "retriever = vectorstore.as_retriever(search_kwargs={\"k\": 20})  # k is the number of documents to retrieve\n",
    "\n",
    "# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory\n",
    "conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88a21bb3",
   "metadata": {},
   "source": [
    "## UI part\n",
    "Create Gradio interface\n",
    "\n",
    "Simple built-in chat interface\n",
    "\n",
    "To Do: Add upload interface to include additional documents in data store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dfe7d75",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Wrapping that in a function\n",
    "\n",
    "def chat(question, history):\n",
    "    result = conversation_chain.invoke({\"question\": question})\n",
    "    return result[\"answer\"]\n",
    "\n",
    "# And in Gradio:\n",
    "\n",
    "view = gr.ChatInterface(chat, type=\"messages\").launch(inbrowser=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llms",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
